Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding addendum namespaces config #140

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mansoor-sajjad
Copy link
Contributor

@mansoor-sajjad mansoor-sajjad commented Nov 29, 2022

👋

Problem statement:

We want to import custom data into pelias index in elasticsearch and want to be able to filter the query results based on those custom data fields like we are able to filter the results based on sources and layers for example.

One solution is to fork the pelias-schema and extend the index mapping with whatever fields we want. But then we need to fork and other pelias services and update them to be able to use those custom data fields. But we really want to be connected with the open source community instead of start maintaining the forks of pelias services ourself. So we implemented the solution by adding some configuration options that can support our use case and probably help others.

Findings:

Pelias schema have the addendum field, where we are able to import whatever data we want. Which is fine but problem is that, firstly its not indexed:

    addendum: {
      path_match: 'addendum.*',
      match_mapping_type: 'string',
      mapping: {
        type: 'keyword',
        index: false,
        doc_values: false
      }
    }

and secondly we can only store JSON data strings into addendum fields if we create the data through pelias-model.

Solution:

In order to filter the results based on addendum fields, we firstly need to make addendum field indexed and secondly we need to be able to store data in other possible searchable types like string or array and so on.

Following are the changes we have made in order implement the above mentioned solution. We have taken special consideration for backwards compatibility.

pelias-schema Link to pull request

First of all, we have added the configuration option indexAddendumField in the pelias-schema to optionally be able index the addendum namespaces. The default value is set to false, in order to keep it backwards compatible. We have not added this configuration option in pelias-config, as it is only used in pelias-schema.

pelias-config (This pull request)

We have added the configuration option addendum_namespaces in the pelias-config with the default value of empty object.

We can configure the names for the namespaces we want to filter the data on, and type of data those namespaces will hold. The type configuration will be useful in sanitization and validation of data for the configured addendum_namespaces while importing and querying.

The configuration will look like the following:

"addendum_namespaces": {
    "tariff_zone_ids": {
      "type": "array"
    },
    "tariff_zone_authority": {
      "type": "string"
    }
  }

Here the tariff_zone_ids and tariff_zone_authority, will be imported as children of addendum field with the type of array and string respectively.

We have added the configuration validation and the Joi schema in pelias-config to make sure that the configuration is correct.

We have restricted the configurable types to only string, number, boolean and array. Because the addendum field in the elasticsearch mapping in pelias-schema is configured to hold keyword type, which cannot store objects.

pelias-model Link to pull request

We have updated the setAddendum function in Document.js to validate the configured addendum_namespaces as configured type, and not always as object type, as it was setup before this PR.

In addition to that all the configured addendum_namespaces will be stored in its configured data type.

And all the non-configured namespaces will be validated as object and stored as json string, as it was before this PR. Which basically means that, if we don’t configured addendum_namespaces at all, we will not be affected by any change in this PR. So we have preserved the backwards compatibility.

We have also added new unit tests for setAddendum function, to test the new usecases.

pelias-api Link to pull request

The goal her is to be able to filter the results based on configured addendum_namespaces directly, without the need to explicitly mention the relationship with the addendum field. And get the result as separate entity i.e. outside the addendum field.

For example:
If we have configured the following in the pelias.json

"addendum_namespaces": {
    "tariff_zone_ids": {
      "type": "array"
    },
  }

then we will be able to filter the results based on the configured namespace, as follows

http://localhost:3100/v1/autocomplete?text=Langholen&tariff_zone_ids=INN:FareZone:94,INN:FareZone:78

Or if we have multiple addendum_namespaces configured like following

"addendum_namespaces": {
    "tariff_zone_ids": {
      "type": "array"
    },
      "public_id": {
       "type": "number"
    }
}

then we will be able to filter the results based on the multiple configured namespaces, as follows

http://localhost:3100/v1/search?text=Langholen&tariff_zone_ids=INN:FareZone:94,INN:FareZone:78&public_id=123

Following are the changes we have made in the pelias-api:

Sanitization

In the pelias-api, we have updated the sanitizeAll.js to find the configured addendum_namespaces in the received parameters and qualify them as a goodParameters.

We have added the new _addendum.js sanitizer, which will sanitize values provided for those configured goodParameters based on the configured types.

We have updated the sanitizers autocomplete, search, reverse, structured_geocoding and nearby to use the new _addendum sanitizer.

Added the unit test to test the changes.

Query

We have added the new query filter addendum_namespace_filter.js, which creates the filter query based on the configured types inaddendum_namespaces

We have updated the quires autocomplete, search, reverse and structured_geocoding to use the new addendum_namespace_filter

Search result

We have updated the geojsonify.js to include the configured addendum_namespaces in the response as separate entity not under the addendum field. All non-configured addendum namespaces will be returned as the children of addendum in the result, as it was prior to this PR.

For example:
Let's say, we have the following configuration

"addendum_namespaces": {
    "tariff_zone_ids": {
      "type": "array"
    },
    "tariff_zone_authorities": {
      "type": "array"
    },
    "public_id": {
      "type": 'number'
    },
  }

and we have imported the following addendum data in pelias index

...
"addendum" : {
  "tariff_zone_ids" : [ 'INN:FareZone:94', 'INN:FareZone:78' ],
  "tariff_zone_authorities" : [ 'TAR' ],
  "public_id" : 123,
  "description": "{\"nor\":\"something\"}"
},
...

Note that the description namespace is not configured, so it kept as JSON string.

We have the following request

http://localhost:3100/v1/autocomplete?text=Langholen&tariff_zone_ids=INN:FareZone:94,INN:FareZone:78&public_id=123&tariff_zone_authorities=TAR

pelias-api will add the following filters in the ES query:

'filter': [
  {
    'terms': {
      'addendum.tariff_zone_ids': [
        'TAR-123', 'TAR-345'
      ]
    }
  },
  {
    'terms': {
      'addendum.tariff_zone_authorities': [
        'TAR'
      ]
    }
  },
  {
    'term': {
      'addendum.public_id': '12345'
    }
  }
]

Note that its terms for array types and term for the string type addendum_namespaces.

The result will look something like the following:

...
addendum": {
    "description": {
    "nor": "something"
  }
},
"tariff_zone_ids" : [ 'INN:FareZone:94', 'INN:FareZone:78' ],
"tariff_zone_authorities" : [ 'TAR' ],
"public_id" : 123,
...

Note the configured namespaces are outside the addendum field and the non-configured namespaces are inside the addendum filed in the response. This basically means that if we don't configure the addendum_namespaces at all, the response will be same as prior to this PR. So the backwards compatibility is preserved.

@orangejulius
Copy link
Member

Hi @mansoor-sajjad,
Thanks for the nice PRs but next time before you do a huge feature feel free to open an issue discussing the problem first.

It would save you a ton of time because unfortunately there's no way we can merge functionality like this. The addendum fields were always intended to be non-searchable additional metadata, so making them searchable would be really messy. In addition there would be massive performance implications to changing how existing addendum data is indexed, and we really try to avoid adding extra parameters to the API.

You can probably achieve much of what you want with the category field and the categories= API parameter.

If not let us know and perhaps we can come up with something.

@mansoor-sajjad
Copy link
Contributor Author

mansoor-sajjad commented Nov 30, 2022

Hei @orangejulius,

We would have opened the issue for discussing the problem first, but we have found that this problem has be discussed in multiple issues before, with either no solution or suggestion to use category field. But we thought to come up with the solution this time instead of creating yet another issue.

All of the concerns you have mentioned are understandable and we had them in mind before implementing this feature.
And we tried to address these concerns through the configuration.

Concern # 1:

The addendum fields were always intended to be non-searchable additional metadata

It will remain non-searchable additional metadata, if we don't configure anything.
addendum field will not be indexed, without explicitly configuring it to be indexed by using the configuration option we have added i.e. setting schema.indexAddendumField to true. This configuration option is not required.

Concern # 2:

Making them searchable would be really messy

We have tried to handle exactly that mess in these PRs. Making sure that the backward compatibility will remain preserved. And no one get affected by this change without explicit configurations.
Every change is unit tested.

Concern # 3:

There would be massive performance implications to changing how existing addendum data is indexed

This is the choice that the user of this feature have to make.
Indexing the extra data, will cost some of the import performance, but the user of this feature have to find the balance between the need for this feature and the performance impact.
This feature will not be available by default, user have to opt in via configuration after making the calculated decision.

Concern # 4:

We really try to avoid adding extra parameters to the API

Again, extra parameters will not be available by default.
Maybe we can have limitation on number of configured addendum namespaces. ?

Concern # 5:

You can probably achieve much of what you want with the category field and the categories= API parameter.

We are already using categories.
But the problem with categories is that, we cannot group different types of data in order to identify it later on.
Category is only a keyword type. If we for example import IDs of different type of data in the category, we can filter on them, but we cannot identify which ID belongs to which data in the response.

@testower
Copy link

testower commented Dec 6, 2022

In our current (v1) of our geocoder we have forked the api project and inject data manually to solve this. This has become difficult to maintain for obvious reasons. Our current effort is to try to solve this within the confines of the upstream projects, with contributions to those projects if necessary.

Using "categories" as a dump-site for custom but indexable metadata is a suboptimal hack. I would be very interested to hear how this could be solved in other and cleaner ways.

@missinglink
Copy link
Member

Hi @mansoor-sajjad, thanks for putting this together.

As Julian mentioned the addendum field was originally designed in such a way that it would be non-searchable, so I'm trying to better understand your usecase in order to consider if reversing that decision is a good idea, or whether there may be an alternative solution to solve your problem.

If we for example import IDs of different type of data in the category, we can filter on them, but we cannot identify which ID belongs to which data in the response.

Can you please explain this a little more so I understand correctly? My understanding is that adding INN:FareZone:94 as a category to a document would allow filtering of results and would also return the string INN:FareZone:94 in the feature properties?

@mansoor-sajjad
Copy link
Contributor Author

mansoor-sajjad commented Dec 8, 2022

Hi @missinglink,

Thank for the response.

I want to take this opportunity to introduce ourselves also:

We are Entur.

Entur is owned by the Norwegian Ministry of Transport and is intended to offer basic public transport services within travel planning and ticketing in Norway.

Entur connects public transport in Norway, and collaborates with the actors in the public transport sector to achieve simple, sustainable travels.

Among other open source services we provide, one of the most important service is Geocoding, which is backed by Pelias. (Thanks to you guys, for making it and have it available for use, and maintaining it like super heroes.)

We are importing different kind of data in ES via Pelias. Like registered stop places in Norway, addresses in Norway, Point Of Interests etc

As mentioned by @testower in his comment that, in our current (v1) of our geocoder we have forked the pelias-api project and inject data manually in elasticsearch in order to import and query based on the custom data.

The category field:

Can you please explain this a little more so I understand correctly? My understanding is that adding INN:FareZone:94 as a category to a document would allow filtering of results and would also return the string INN:FareZone:94 in the feature properties?

We are using the category field extensively for storing different identifiers based on different type of data whether it be stop places, addresses or POI. And we can filter on it and get the data in the response in feature properties. The clients identify the identifiers in the category field based on the source and layers. Which is nice.

We have some additional data, like tariffzone and farezone etc. We would like to filter the results on these in addition to the identifiers we are storing in the category field.

We can store the tariffzones and farezones identifiers in the category field also, but we have two problems here:

First Problem:
Category field is just the simple ES keyword type field

If we query and filter on some tariffzone id, we will get the category array in the response in feature properties, but that will then contain the tariffzone identifiers, farezone identifiers and other data identifiers which we store in category field.
It will be impossible to identify which id belongs to which data type in the category array in feature properties.
Its like different ids in same database column.
For example the following array is the mix of data identfiers. What is 678, and what is 345 is impossible to identify.
[123, 345, 678]

Second Problem:
Categories filter is a OR filter not an AND filter.

For example, we are storing the transport type id in category field for stop places. If we store the tariffzone in the category field also.
If we now wants the data with the given transport type and the tariffzone. We will query with following parameters.

&categories=someTransportTypeId,someTariffZoneId

It will be interpreted as someTransportTypeId OR someTariffZoneId, but we need it to be someTransportTypeId AND someTariffZoneId. That is we want only those records which have both the given transport type id and the given tariffzone id. Not either one or other.

@mansoor-sajjad
Copy link
Contributor Author

mansoor-sajjad commented Dec 14, 2022

Third Problem:
Category array will not be returned in the response in feature properties, if we don't use the categories filter.
We have to use some categories filter in query to get the category in response.
Ref: https://github.com/pelias/api/blob/d1163de90bf30f300f10568d2d521f21f046ffd6/helper/geojsonify_place_details.js#L60

@testower
Copy link

testower commented Dec 15, 2022

To give a little more detail to our use case - we have around 60k public transport stop places from Norway's National Stop Registry in our geocoder. These stop places have modality information (e.g. transport mode bus and submode localBus). In addition they belong to one or more tariff zones for ticketing purposes.

A consumer may want to search within all bus and tram stops within a certain tariff zone. This will not work if collapsed into the categories filter: categories=bus,tram,RUT:TariffZone:1, because the filter is an OR filter, so we will get all stops which are a bus or tram stop or in the specified tariff zone.

Instead what we actually need is all bus OR trams stops that are also in the specified tariff zones, in other words, the filter should be OR within each filter type, but AND across the filter types.

We have other similar use cases within other types of entries in our geocoder.

So as you can see, solving this using the categories filter is not possible, which is why we have proposed the possibility to add custom indexable data fields.

Would you open to discussing a solution to this, that does not involve repurposing the addendum field, but instead adding a new concept that will allow adding custom indexable data? In essence a structured version of the existing categories.

@missinglink
Copy link
Member

missinglink commented Dec 15, 2022

A consumer may want to search within all bus and tram stops within a certain tariff zone. This will not work if collapsed into the categories filter: categories=bus,tram,RUT:TariffZone:1, because the filter is an OR filter, so we will get all stops which are a bus or tram stop or in the specified tariff zone.

Isn't it possible to achieve this by changing the way you define your categories?

Ie. if the bus stop and the tram stop are both given the category TariffZone1 then you can use ?categories=TariffZone1 to retrieve both, and not include any stops from other tariff zones.


It will be impossible to identify which id belongs to which data type in the category array

This can also be resolved by changing the the way you define your categories.

ie. rather than only specifying an integer, prefix the 'type', such as tram_stop:1


Category array will not be returned in the response in feature properties, if we don't use the categories filter.
We have to use some categories filter in query to get the category in response.

It's not documented, but you can specify an empty list of categories, which will return the categories in the response:
?categories&text=example

@missinglink
Copy link
Member

missinglink commented Dec 15, 2022

Second Problem: Categories filter is a OR filter not an AND filter.

For example, we are storing the transport type id in category field for stop places. If we store the tariffzone in the category field also. If we now wants the data with the given transport type and the tariffzone. We will query with following parameters.

&categories=someTransportTypeId,someTariffZoneId

It will be interpreted as someTransportTypeId OR someTariffZoneId, but we need it to be someTransportTypeId AND someTariffZoneId. That is we want only those records which have both the given transport type id and the given tariffzone id. Not either one or other.

This can also be resolved by adding additional categories which are the concatenation of two disparate terms (ie. someTransportTypeId+someTariffZoneId, these concatenated categories can then be used to gain more control over how you query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants